Modeling scientific-citation patterns and other triangle-rich acyclic networks 
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We propose a model of the evolution of the networks of scientific citations. The model takes 
an out-degree distribution (distribution of number of citations) and two parameters as input. The 
parameters capture the two main ingredients of the model, the aging of the relevance of papers and 
the formation of triangles when new papers cite old. We compare our model with three network 
structural quantities of an empirical citation network. We find that an unique point in parameter 
space optimizing the match between the real and model data for all quantities. The optimal param- 
eter values suggest that the impact of scientific papers, at least in the empirical data set we model 
is proportional to the inverse of the number of papers since they were published. 

PACS numbers: 89.65.-s, 89.75.-k 



I. INTRODUCTION 

The boom of networks studies of the last decade ^ ^ 
has potentially an impact of the structure of science it- 
self. Network measures can help creating better bib- 
liometric quantities to evaluate scientific impact [3] and 
the sociological aspect of scientific collaboration and ex- 
change of ideas. Indeed, the study of scientific cita- 
tions has become a subfield of complex network stud- 
ies [iisiiniiiiEiisiiioiiiiiiniiis]. 

One typical feature of academic citation networks is 
that the number of citations to a paper decreases with 
its age. Inspired by this point, many works have been fo- 
cused on how a paper's age influences its ability to attract 
new citations [H [TOl [Til [13] (oi"? equally, new attachments 




FIG. 1: (Color online) An example of a citation network — 
the citation network of articles cited by this articles (with the 
indices being the indices of the reference list). 



in the network). Specifically, it is believed that the at- 
tachment rate (the rate of new citations to an old paper) 
is dependent on both the current number of citations (its 
in- degree in the network) and its age. (Here we consider 
citations going back in time meaning that out-degree is 
the number of references and in-degree is the number of 
citations.) Another important constraint of citation net- 
works is that they are time ordered — of any pair of 
papers, one is the oldest. (It might, in practice, be more 
relevant to consider papers published almost simultane- 
ously unordered, but in this work we assume this is a 
negligible effect). An important consequence of the time 
ordering is that citation networks are acyclic, i.e. there 
are no closed (directed) paths. In Fig. [l] we show a small 
citation network as an example. This network shows is 
the references of this paper and how they cite each other. 
In a recent paper ^4], Karrer and Newman (KN) pro- 
posed a random graph model for directed acyclic graphs. 
In the KN model, the vertices are ordered by time and 
their in- and out-degrees are pre-assigned (similar to the 
undirected "configuration model" [15 ). The vertices are 
added to the network iteratively (from 1 to with N be- 
ing the network size), and for each new vertex arcs (di- 
rected edges) are added from old vertices whose in-degree 
is lower than their prescribed value until v^s out-degree 
is as large as its prescribed value. Karrer and Newman 
validate their model with empirical measurements and 
get good agreements for some quantities [M], but their 
model does, as we will show, not generate as many tri- 
angles as real citation networks have. (Note that there 
are two topologically different directed triangles, but only 
one of them is acyclic, which makes the word "triangle" 
unique in this study.) In this work, we present a model 
of academic citation networks that remedies the lack of 
triangles in the KN model by building on mechanisms 
arguably at work in the scientific process. In this paper, 
we first discuss the structure of empirical citation net- 
works, then present the model and last test it against 
three network-structural quantities of real citation net- 
works. 
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FIG. 2: (Color online) The number Ti of triangles as a func- 
tion of the order the vertex is added i. The solid, dotted, and 
dashed lines correspond to the empirical citation network of 
high-energy physics papers, the KN model and the extended 
KN model, respectively. The values of Ti for the two the- 
oretical models are averages over two hundred independent 
samples. 



II. EMPIRICAL MEASUREMENTS AND THE 
PREDICTIONS OF THE KN MODEL 

Before presenting our model we state the most impor- 
tant motivation for this study. In Fig. [2] we show the 
number Ti of triangles in an empirical citation network 
consisting of TV = 27,770 papers (or rather preprints) 
on theoretical high-energy physics. There are in total 
352,285 citations (or arcs^ directed edges) among them. 
The data set comes from preprints posted on |arxiv. org| 
between 1992 and 2003. By measurement, we define a 
triangle as the pattern "paper A citing B and C, and 
B citing C", and calculate the number of such patterns 
present in the network when going through the papers 
from 1 to (the order of their appearance on the web- 
site). To reduce the computational complexity, we sam- 
ple each 200'th z-value. For comparison, we also plot the 
predicted number of triangles of the KN model, and a 
simple extension of the KN model introducing more tri- 
angles: When a new vertex enters the network, rather 
than randomly matching all its out-degrees with those 
in-degrees among the existing vertices, after first match- 
ing one out-degree randomly with an in-degree belonging 
to an older vertex w (like the KN model) , we let as many 
of the remaining arcs as possible to come from neighbors 
of w (and after that, also the neighbors of its new neigh- 
bor). Note that, by the definition of the KN model both 
the network size N and the degree sequences (both in- 
and out-degrees) are identical with the empirical data. 
Both the KN model and the extension underestimate the 
number of directed triangles in the real network. 



HI. MOTIVATION AND DEFINITION OF THE 
MODEL 

In this section we will discuss and motivate our model. 
We start by ordering the vertices temporally as in the real 
data, and their out-degrees (the number of citations) are 



kept as the same as the original. (Alternatively the de- 
grees can be drawn from some appropriate distribution.) 
We do not restrict the number of in-degrees — that will 
be an emergent property of the model that we will use 
for validation. We add the vertices one by one and fill up 
the out-degrees of the new vertex before adding a new. 

A common assumption is that the relevance of a paper 
decays with its age [3, ^ [6l [8l El [TOl H HJ. In other 
words, science will move away from any paper. For this 
reason, we let the first arc from a new vertex i go to 
an old vertex with a probability Yli^j proportional to 
its age tj = i — j to a power a (where a negative a re- 
flect an attachment probability decaying with age). For 
to fill up the remaining out-degrees of i, we attach arcs 
with probability f3 to random (in- or out-) neighbors of j, 
and otherwise (i.e. with probability 1 — /3) attach arcs to 
older vertices with probability Yli^j above. If there is 
no available neighbor to attach to (we assume one vertex 
cannot link to another vertex twice, or to itself), we make 
an attachment of the first type. Note that the number 
of candidates whom i can connect to increases with more 
out-degrees in the system, i.e. with time. This triangle- 
formation step (proposed in Ref. [16 as a model of scale- 
free networks with a tunable clustering coefficient) is a 
mechanism that, we argue fits well to citation networks. 
To put a scientific paper in the right context one cite 
papers of the same theme, since these papers are similar 
to each other they are likely to each other. This in itself 
means that we can expect many triangles — if paper A 
cites B and C and B also cites C with a relatively large 
probability, which is effectively the same as the triangle 
formation sketched above. As a more explicit mecha- 
nism one can imagine that when working on paper A the 
researchers may find paper C from the reference list of 
paper B. In sum, our model has two input parameters a 
and P (in addition to the degrees) , governing the two key 
ingredients — aging and triangle formation. 



IV. MEASURED QUANTITIES 

Following Ref. [M], for each vertex i, we define a pa- 
rameter 

^^=E^f-Efcr- (1) 

is thus the sum of in-degrees of the vertices that have 
been added in the network before i (i.e., from the vertex 
1 to the vertex ^ — 1) minus the sum of in-degrees. As 
pointed out in Ref. [14 , this parameter should satisfy the 
conditions > for z = 2, • • • , n — 1 and Ai = A^ = 0. 
The interpretation of A^ is that it is the number of arcs 
that connecting vertices later than i to vertices earlier 
than i [Ml. We will also measure P(/cin), the probabil- 
ity of randomly selecting a vertex whose in-degree is /cin, 
and Ti. After the networks are constructed, we measure 
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FIG. 3: (Color online) Network statistics for our model. The 
solid lines correspond to the citation network of high-energy 
physics papers and the dashed lines represent our model data. 
In (a) and (b) we use the model parameters a — —\ and 
/3 = 0.99. (a) shows the average number T^ of triangles as 
a function of the index of the added vertex i. (b) displays 
the number of arcs passing as a function of i. (c) and 

(d) corresponds to (a) and (b) but for a = and /3 = 0.99. 

(e) and (f) show the same for a — j5 — ^. (g) and (h) also 
corresponds to (a) and (b) but for parameters a — —\ and 
/3 = 0. 



these three quantities and compare them with the corre- 
sponding empirical values. The results presented below 
for models are averages over 200 independent network 
realizations. 



V. RESULTS 

Now we turn to the numerical results for our model. 
We first investigate the model dependence on the param- 
eters Oi and f5 and compare the values of and to 
the real data. By construction, large /3-values give large 
numbers of triangles. As seen in Fig. [sj^a) there are (un- 
like the results in Fig. |2| parameters giving a number of 
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FIG. 4: (Color online) The in-degree distribution P(/cin) for 
our four parameter combinations (indicated in the panels). 



triangles that matches the empirical curves. A negative 
a- value is important, not only to get T^- values matching 
the empirical data, but also to obtain matching A^-values 
(Fig. [sj^b)). We have scanned the region of a G [—2,0] 
and P e [0, 1], and found that the combination a = —1 
and /3 = 0.99 gives the best fit to the empirical data [17]. 
To give an overview of the model's behavior we plot three 
other combinations of a- and /3-values in Fig. [3] In Fig. [3] 
(c) and (d) we show the results for a = and /3 = 0.99. 
When a = the chance of acquiring new arcs is inde- 
pendent of age. The chance of reaching a vertex with a 
triangle-formation step is proportional to the degree of 
the vertex leading to a preferential attachment (an at- 
tachment probability increasing with degree) for high /3 
and low a. (Note that the first network model with pref- 
erential attachment was a model of citation networks [4 .) 
Fig. [3] (c) shows that even though P is nearly maximal, 
the number of triangles is not as large in the empirical 
data. The reason for this is that there are more success- 
ful triangle-formation steps — or, equally, that it is less 
probable to attach to a vertex with lower total degree 
than the desired total degree of the new vertex — for 
negative a. In Fig. [sj^d) and (e) we present the results 
for a = and P = 0. In this case, both the aging effect 
and clustering effect are absent. Not surprising, neither 
Ti nor Xi match the real data. Even though the arcs 
reach longer back in time for this case, the number of 
arcs passing i (i.e. A^) is lower. The data for a = —1 and 
P = are plotted in Fig.jsjg) and (h). We note that with 
the absence of the triangle-formation step, not only the 
number of triangles, but also A^ is underestimated. As 
a final comment to Fig. [3) the cusps around i = 21,000 
is due to a change in the raw data where the sampled 
database was split into different categories and the sam- 
pled papers after this point cites, on average, fewer other 
papers. 

Our third quantity is the in-degree distribution that 
we plot in Fig. [4) The both curves with P = 0.99 fit the 
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real distribution well. As mentioned, there is an effective 
(though no necessarily linear) preferential attachment in 
this case, which explains the broad distributions. puts 
strong constraints on the in degree distribution — if both 
and the out-degree distribution would be fixed to the 
observed data (not only the out-degree distribution as in 
our case), then the in-degree distribution is the same as 
the observed data. With low /^-values, the in-degree dis- 
tribution becomes much more narrow than the empirical 
data. Combining Figs. [3]and|4j we note that though ap- 
propriate large value of /3 could generate networks with 
in-degree distribution fitting the empirical data, the lack- 
ing of ageing effect would fail to modeling the evolution 
of citation network of scientific papers. Taking all these 
observations into account, both aging and triangle for- 
mation seem to be important mechanisms in the citation 
network. 



VI. CONCLUSIONS 

We have proposed a random, evolving network model 
for scientific paper citations. In our model, the attrac- 
tiveness of a vertex (paper) decays with its age with 
power a, another parameter P determines the number of 
triangle formations (when a new paper cites two papers 



where one cite the other). We compared our proposed 
model with an empirical citation network of high-energy 
physics preprints posted at arxiv . org( The out-degree 
distribution is an input to our model. In this paper we 
take it from empirical data. We use three quantities to 
validate our model — the number of triangles, the num- 
ber of arcs passing the vertex and the degree distribu- 
tion. All these quantities are best modeled for param- 
eter values a = —1 and f3 = 0.99 |T7]. From these ob- 
servations, our model suggests that in citation network 
of scientific papers, the probabilities of attracting new 
citations of the papers are about inversely proportional 
to their age (measured in its position in the sequence of 
publication) and that there is a strong tendency of citing 
papers where one paper cites the other. For the future, 
we believe it would be informative, as a complement to 
generative models like the present, to study the mecha- 
nisms of citations by interview studies and questionnaires 
to researchers. 
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